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Abstract 

To attain the best learning accuracy, people move 
on with difficulties and frustrations. Though one 
can optimize the empirical objective using a given 
set of samples, its generalization ability to the en- 
tire sample distribution remains questionable. Even 
if a fair generalization guarantee is offered, one 
still wants to know what is to happen if the reg- 
ularizer is removed, and/or how well the artificial 
loss (like the hinge loss) relates to the accuracy. 

For such reason, this report surveys four different 
trials towards the learning accuracy, embracing the 
major advances in supervised learning theory in 
the past four years. Starting from the generic set- 
ting of learning, the first two trials introduce the 
best optimization and generalization bounds for con- 
vex learning, and the third trial gets rid of the reg- 
ularizer. As an innovative attempt, the fourth trial 
studies the optimization when the objective is ex- 
actly the accuracy, in the special case of binary 
classification. This report also analyzes the last 
trial through experiments. 

1 Introduction 

A generic learning problem can be regarded as an optimiza- 
tion over parameter w 6 W, and the objective function is 
given by /(w; 9) where 9 is a given sample. An empirical 
objective F(w) can be written as 



F(w) =E[/(w;0)] = 



1 



i=l 



/(w;0<) 



here 6\, O2, ■ • • , 9 m are a sequence of observed samples. It 
is normally assumed that these samples are i.i.d. drawn from 
some unknown distribution D, and therefore the stochastic 
objective F(w) is often more desirable: 



F(w)=Ee„D[/(w;0)] . 
For example, if we take f(w; 9 = (x, y)) 



max{0, 1 



y(w,x)} + -|||w||2 we will arrive at the famous SVM, with 
a weighted £2 norm regularizes The kernel trick can also 
be adopted which allows a non-linear prediction: /(w; 9) = 
max{O,l-j/(w,0(x))} + |||w||2. 



Trial l.To obtain a good learning accuracy, the first trial 
is to optimize over the empirical objective F(w). We nor- 
mally assume that /(•; 9) is a convex function because we 
can then utilize optimization techniques like interior point 
method or gradient descent to conquer the minimization. Re- 
cent results in for example MDSSST10irSSSS071 have shown 
that in many cases, using the stochastic gradient descent ex- 
periences th e fastest run time in minimizing the empirical ob- 
jective, and IZCW+091 has generalized this in the extent of 
using kernels and (batched) parallel computation. In Section 
12 we will briefly describe the main framework of I DS SST10I . 
because it embraces all previously known first-order algo- 
rithms as special cases. 

Trial 2. To better understand the learning accuracy to 
future samples, we need to establish the connection between 
the stochastic objective F(w) and empirical objective F(w). 
This is called the generalization. A recent paper IS SSSS09B 
completed a thorough classification over the types of learn- 
ing problems, and tells us how well each of them guarantees 
a good generalization error bound. The main results of this 
paper is provided in Section[3]as a good reference, but this is 
only the second trial. 

As a partial summary, in the above two trials, when / is 
convex both results have an error bound proportional to ^= . 
In Step 1, this T is the number of iterations (which is also 
proportional to the runtime) and the error bound is the differ- 
ence between the optimal solution and the one generated by 
the SGD. In Step 2, this T is the number of samples and the 
error bound is (imprecisely) the difference over the stochas- 
tic and empirical objective. However, if we further require 
/ to be strongly-convex, by for instance adding a regular- 
ize^ both bounds immediately decrease to ^ instead of 
and this partially explains why we add a regularizer from a 
theoretical point of view. 

Trial 3. Now comes to the third trial, an attempt to bound 
the stochastic loss instead of the stochastic objective. As 
explained above, we often add a regularizer (e.g. r(w) = 
\ || w|| 2 ) to the objective function /(w; 9) = Z(w; 0)+r(w). 
Therefore, even if we achieve a close-to-optimal solution for 
the stochastic objective F(w), it is still far away from the 
stochastic loss 

L(w)=Ee~.Dp(w;0)] . 

To furth er build a connection between these two quantities, 
flSSS08llZCZ+09l used a so-called oracle inequality and de- 



duced a final bound for this stochastic loss, with respect to 
the running time of a program, and the number of training 
samples m. This work was recognized by the best paper 
awards of ICML 2008 and ICDM 2009, and briefly described 
in Section|4] 

Trial 4. However, how well such loss function character- 
izes the word "accuracy" remains a problem. One can feel 
free to use any convex loss functions (like hinge loss or lo- 
gistic loss), but they do not reflect the accuracy at all. In the 
paper of 1SSSS101 , they consider the following non-convex 
objective: 



/(w;0 = (x,!/)) 



isgn((w,0(x)) + - 



y 



which is exactly the definition of accuracy (for binary classi- 
fication) if y £ {0, 1}. 

Though incorporating the traditional Rademacher gener- 
alization bound HBM03I one can still obtain good general- 
ization guarantee, the empirical optimization becomes hard 
(Appendix A of HSSSS10I ). Realizing such difficulty, MSSSS10I 
studied improper learning and constructed a larger class of 
classifier. Not only the empirical optimization in the new 
concept class is convex, the minimizer is also good enough 
so that original zero-one objective problem is learnable us- 
ing such minimizer. This paper was recognized by the best 
paper award of COLT 2010, and this report is also going to 
analyze its practical value against public data sets in Section 


1.1 Preliminary 

For the lack of space, the definitions of strongly-convex, Lip- 
schitz continuity, dual norm and Bregman divergence are ig- 
nored in this report. The readers who are interested in the 
technical details may refer to (nearly) any of the references 
attached to this report and look into its preliminary section. 

2 Trial 1: Empirical Optimization 

We denote the empirical objective for the t-th sample as 

/(w;0 t )=/t(w)+r(w) , 

in which r is a convex regularization function, and f t is a 
convex loss function associated with example t. 

In the online and batch learning setting, we have a se- 
quence of T samples. At the beginning the the t-th round, 
the algorithm must make a prediction w t and then receive a 
function f t . The ultimate goal is to minimize the following 
regularized regret: 

T 

R(T) = max V[/ ( (w t ) + r(w t ) - / t (w*) - r(w*)] 
wew * — ' 
t=i 

in which w* is the optimal empirical minimizer for this se- 
quence of samples. 

The paper MDSSST10I defined the following update se- 
quence 



Here r\ is a parameter that is to be tuned later, d is the sub- 
gradient, and wi can be set to zero vector. The Bregman 
divergence associated with ip is defined as 

B^(w, v) = tp(w) - if>(v) - (Vi/>(v),w - v) , 

and we require ip to be a-strongly convex w.r.t. a norm || • 



,|, and then S^,(w, v) > , 
simple choice is to let ^(w) = 



— v|| for this a. A very 
| w|| 2 and then £?^(w, v) = 



One of the main theorems in ODSSST10I proves that: 

Theorem 1 Let ip be a-strongly convex w.r.t. norm || • ||. 
Suppose VV is compact OR the function f t is G-Lipschitz 
\\df t \U<G. 1 Then setting r)= ^2aB i ,(w*,w 1 )/(GVT), 



W(+i = argmin?7(<9/t(w t ), w) + B</,(w, w t ) + ?yr(w) 
wew 



R 4 ,{T) < v /2TB^(w*,w 1 )G/V^ ■ 

Notice that this result does not use any property of the regu- 
larized term r(w), and it holds even if r(w) = 0. 

This square root (w.r.t. T) regret allows one to design a 
stochastic gradient descent algorithm with random sampling 
over the training data set. However, to obtain an optimization 
error of e, one usually needs to run T = VL{l/e 2 ) number of 
iterations. This is not good enough. 

When the objective function /t(w) + r(w) is strongly- 
convex over w, w.l.o.g. we can let r be A strongly-convex 
and / be only of the classical convexity. In this case if we 
replace the update sequence of Eq. Q] and let rj vary for dif- 
ferent t. Specifically, we let rjt = tj. With only little extra 
effort, one can prove the following logarithmic regret bound: 

Theorem 2 Let r be X-strongly convex w.r.t. ip, and ip be a- 
strongly convex w.r.t. norm \\ ■ \\. If function f t is G-Lipschitz 
\\dft\\* < G. Then: 



Notice that if we set ^(w) = |||w||| and we immedi- 
ately arrive at the online counterpart of the famous PEGASOS 
algorithm, which is the currently best-known linear SVM 
classifier QSSSS07I , and arguably the best k nown kerne l SVM 
classifier at least under the parallel setting ||ZCW + 09l . 

In sum, depending on the convexity of the objective func- 
tion, one may adopt either the convex or the strongly-convex 
version of the mirror descent algorithm summarized in [DSSST10 1, 
with a satisfiable theoretical bound on the regret. Regarding 
the off-line problem given a fixed set of training samples, as 
long as in each iteration a sample is uniformly chosen at ran- 
dom from this training se t, a simila r bound can be deduced 
just like IISSSS07IISSS081IZCZ+091 . using Markov inequal- 
ity. 

3 Trial 2: Stochastic Optimization 

As advertised in the introduction, although a sub-optimal w 
is deduced from minimizing the empirical objective F(w), 
we are more interested in its generalization ability to F(w). 



(1) 



is the dual norm of 




Figure 1: Three classes of learnable problems. Picture borrowed from [ SSSSS09 1. 



In this section we follow the convention: use w to denote the 
empirical minimizer, and w* the stochastic minimizer. 

The paper | SSSSS09 1 analyzed three classes of problems 
(see Figure[T). The first and the smallest class is the one that 

guarantees uniform convergence: 



sup 



F(w) - F(w) 







which says that the difference between two kinds of objec- 
tives tends to zero as m — > oo. A relatively larger class is the 
one that is learnable by empirical minimizer: 

F(w) - F(w*) -> , 

which guarantees point-wise convergence at the empirical 
minimizer. At last, the largest class of learnable functions 
is defined as 

"there exists a rule for choosing w based on sam- 
ples, such that F(w) - F(w*) -> 0". 

All of the generalized linear problems 2 are included in 
the set of uniform convergence. These are the stared rectan- 
gle in Figure Q] 

At the same time, most of the problems (satisfying con- 
vexity, Lipschitz, boundedness, etc) are learnable (the trian- 
gle in Figure [TJ, but not necessarily using empirical mini- 
mization, and not necessarily guaranteeing uniform conver- 
gence. A generic way to learn such function is via online 
convex optimization. One may refer to Section [2] for the al- 
gorithm, and its generalization guarantee is summarized in 
Eq.(7) and Eq.(8) of IISSSSS09I . 

Remark: All of the generalization error guarantees men- 
tioned above are either in a factor of -}= or in a factor of 

V"' 

~, depending on whether the objective function is convex of 
strongly-convex. 

The main contribution of [SSSSS09I is that they showed 
all Lipschitz-continuous strongly convex problems (dotted 



Satisfying /(w; 6) = g((w, <f>{6));9) + r(w). This includes 
SVM, logistic regression and all kinds of supervised learning as 
mentioned in the summary of ISSSSS091 . 



rectangle in Figure[T|i are learnable with empirical minimiza- 
tion. This means that if one finds the empirical minimizer 
w, F(w) and F(w*) are guaranteed to be close enough. Of 
course, there is also a good guarantee for any sub-optimal 
empirical minimizer w: 



2L 2 



F(w) - F(w*) < ^=j-^/F(w) - F(w) + i^- , 

here L is the Lipschitz continuity constant for /, A is the 
strong-convexity constant for /, and S is the confidence level. 

4 Trial 3: Stochastic Loss Optimization 

Because the generalization bound differs between convex and 
strongly-convex objectives, we usually have to add a (strongly - 
convex) regularizer to ensure a better error bound between 
the empirical and stochastic objective. For instance, we can 
add a i2-norm regularizer to the hinge loss, resulting in the 
famous SVM problem. 

If we have /(w; 6) = Z(w; 9) + r(w), we can define 
F(w) = L(w)+r(w) where L(w) =E e ^ D [l(w;0)]. Then 
the mathematical term we are more interested is actually: 

L(w) - £(w ) , 

for a solution w given by the algorithm, and the loss mini- 
mizer W q = argm in,,, L(w). By using the oracle inequality 
IISSS08llZCZ+09l : 

L(w) — L(wq) 
= (F(w)-F(w*)) + (F(w*)- J F(w )) 
- r(w)+r(w ) 
< (F(w) - F(w*)) + r(w ) , 

one may deduce a bound of the generalized loss error given 
a generalized error F(w) — F(w*), while the latter is al- 
ready obtained in Section [3] Though this bound looks loose 
(neglecting two negative terms), through a careful selection 
of the weight hidden in the regularizer r(w), one may find 
that the practical b ehavior matches this theoretical bound, in 
ISSS08llZCZ+09l . 



5 Trial 4: Zero- One Loss 



Not satisfied by the result in the previous section, HSSSS10I 
made an interesting attempt towards learning 0-1 objective 
functions. In classification problems with a half-plane clas- 
sifier, the following objective function is more desirable than 
any other (regularized or not) convex objective (e.g. hinge 
loss, logistic loss): 



/(w;0 = (x,i/)) 



1 sgn((w,0(x)) +--y 



2 o u ,rv // 2 

Notice that the label ye {0,1}. 
5.1 The theory 

If we define ipo-i(a) = i(sgn(a) + 1), the above objective 
function is characterized by the following concept class 



H 



V>0-1 



{x-> ¥> -i((w,^(x)))} 



and we are interested in optimizing the following stochastic 
objective: 



F(h)=E {x ^ D [\h(x)-y\} > heH 



Po-i 



(2) 



The first step of this attempt requires the approximation 
to H Val using Lipschitz continuous functions. Define 



^sig(a) 



1 



1 + exp(-4La) 



which is L-Lipschitz continuous and approximates <po-i well. 3 
Now consider the following concept class for the stochastic 
objective (Eq.|2]i 

II.. = {* -> Psig((w, ^(x)))} . 

One advantage of such approximation is to allow theo- 
rems like Rademacher generalization bound [BMQ31 to hold. 
Indeed, the empirical minimizer, 



w = argminF(w) = argmin — |^ sig ((w, x. t )) 

m — * 



1=1 



gives a generalization error bound F(w) — F(w) < e when 
m = fl(L 2 /e 2 ). However, it has been pointed out that the 
empirical minimization is "hard" since the objective is not 
convex. ISSSS101 

To conquer such difficulty, a new concept class is intro- 
duced: 

# B = {x^(w,V(x)) : ||w|| 2 < , 

and its difference from H ipal or H^ is twofold. First, it no 
longer uses a 0-1 function in the prediction; the traditional 
half-plane classification using inner-product is adopted. Sec- 
ond, it enables a new kernel ip, which is defined as 4 : 



(V>(x)>(x')> = 



1 



1 - i/(</>(x),0(x')) 



Choosing B = fi(exp(Llog(^))) large enough, ISSSS101 
proved that Hb approximately includes H Vsi „, and therefore 



3 In |SSSS10| they also analyzed other two approximated func- 
tions, but for the lack of space they are ignored here. 

4 i/ can be chosen to be 1/2 for the ease of presentation. 



we can directly study the learning problem in Hb ■ This only 
requires the Lipschitz continuity of (^ s ; g and Chebyshev ap- 
proximation technique, and is a very general proof. 

One big benefit of such conversion is that the new prob- 
lem is convex and can be empirically optimized via for in- 
stance stochastic gradient descent mentioned in Section [2] 
Pay attention that due to the boundedness of Hb, using Rademacher 
complexity bound again IBM031 IKST08L a sample com- 
plexity of n(B/e 2 ) can be deduced. 

The procedure above is improper learning: to learn H Vo _ 1 
we actually incorporate a larger concept class Hb, and a 
classifier h G Hb will be returned which is close to the op- 



timal classifier in H Vo _ x . Furthermore, the overall time and 
sample complexity is poly(exp(X log(— ))). This bound is 
exponential w.r.t. L, but [SSSS10] also showed that a poly- 
nomial dependency on L is impossible, unless some NP-hard 
problem is in P. 

5.2 The experiment 

Though the complexity depends crucially on L, however, 
when L is a constant (e.g. 1) there is still reason to believe 
that the approximated 0-1 loss function using ip s i g is more 
accurate than the hinge loss. However, even if L is constant, 
the sample complexity m = n(B/e 2 ) = / e 3 ) has cubic 
dependency on e. At the same time, the kernel stochastic gra- 
dient descent algorithm (in the distribution manner) requires 
a time complexity of 0(m 2 ). So there comes a dilemma: 
neither can we set m to be too large since otherwise the em- 
pirical optimization cannot finish in endurable time, nor can 
we set m to be too small since otherwise the generalization 
guarantee is not good enough. 

Consider some m of medium size, just enough to be 
trained in several minutes for example. Though zero-one loss 
does not have a good generalization guarantee, however, it is 
a more desirable function than the hinge loss. So given train- 
ing set of medium size, and we run SVM against a zero-one 
loss minimizer, who is to win this tug-of-war? 

5.2.1 Configuration 

In the interest of fairness, the s ame stocha stic gradient de- 
scent routine called P-packS VM ||ZCW + 09| has been adopted 
for the experiment, on an Intel Quad CPU machine with four 
cores (4 times speed-up). Regarding the empirical optimiza- 
tion for the zero-one loss, two different approaches are im- 
plemented: ZeroOne refers to the classical mirror descent 
inThm.Q] and ZeroOne-reg refers to the regularized zero 
one loss objective with strongly-convex mirror descent in 
Thm.f2] At last, PEGASOS refers to the empirical optimizer 
for regularized SVM algorithm. 

The three datasets Splice, Web and Adult presented by the 
libSVM project team MFanl are used. 





# Training / Testing Samples 


# Features 


Splice 


1000/2175 


60 


Web 


2477 / 47272 


300 


Adult 


1605 / 30956 


123 



5.2.2 Results 

From Table Q] one can see that when m is on the magnitude 
of 1000, it is still unclear that which method outperforms 
which. For the dataset of Adult, the traditional SVM trainer 



Table 1 : The accuracy report for three different methods on three different datasets, with the best-tuned parameters listed. The 
program has been run 5 times and the mean accuracy is chosen. The number of iterations is 100000 and all program runs in 30 
seconds. 





PEGASOS 




ZeroOne 




ZeroOne-reg 




Accuracy rbf 


A 


[ Accuracy rbf 


A 


Accuracy rbf A 



Splice (Gaussian) 


0.90069 


0.02 


0.0003 


0.903448 


0.01 


0.08 


0.902989 


0.01 


0.0006 


Splice (Linear) 


0.846897 




0.0006 


0.885517 




0.01 


0.877701 




0.0003 


Adult (Gaussian) 


0.844004 


0.025 


0.0003 


0.84016 


0.0125 


0.003 


0.840742 


0.0125 


0.0002 


Adult (Linear) 


0.842971 




0.0003 


0.838674 




0.02 


0.838448 




0.0003 


Web (Gaussian) 


0.980094 


0.0125 


0.00003 


0.980729 


0.0125 


0.001 


0.981236 


0.0125 


0.0001 


Web (Linear) 


0.976667 




0.0003 


0.981152 




0.003 


0.980411 




0.0003 



PEGASOS significantly beats the zero one loss optimizers; 
while back to Splice and Web, zero-one loss does slightly 
better. 

Another experiment worth conducting is the convergence 
rate of the three methods. From Figure [2] one can see that 
because ZeroOne uses a learning rate of 1/y/T where T is 
the total number of iterations, the accuracy curve is smoother 
than that of PEGASOS and ZeroOne-reg, who uses learn- 
ing rate 1 JT for strongly-convexity. However, because the 
generalization guarantee for 0-1 loss is weaker than SVM, 
ZeroOne-reg converges much slower than PEGASOS. 

Notice that Web is a highly -biased dataset and 97 percent 
of the samples are negative. This explains that why all meth- 
ods seem to perform with great turbulence on this dataset. 



Parallel primal gradient descent kernel svm. In 
ICDM, pages 677-686, 2009. 
[ZCZ+09] Zeyuan Allen Zhu, Weizhu Chen, Chenguang 
Zhu, Gang Wang, Haixun Wang, and Zheng 
Chen. Inverse time dependency in convex reg- 
ularized learning. In ICDM, pages 667-676, 
2009. 
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Figure 2: The variation of the accuracy as the number of training iterations increases. 



